Matthew Talluto
21-22 Feb 2024
R is two things:
RStudio is a comprehensive working environment for R
(https://rstudio.com/products/rstudio/)
After launching Rstudio, create a new R script using the button in the upper left
Script: a text file where you will write an R-program. Commands in a script will be run in order, from the top to the bottom.
\(^1\)Mac users: usually you can substitute the command (⌘) key for control, and option for alt
# character is a comment and
will be ignored by the interpreter._ symbol.= or <- symbol= or
<-, not both)# Comments in R start with the # symbol
# anything after # will not be executed
# Comments are a useful way to annotate your code so you know what is
# happening
# Legal variable names
x = 1
y0 = 5
time_of_day = "20:15"
dayOfWeek <- "Monday"
# bad!
# d is the diversity in our site, in species
d = 8
# better!
site_diversity = 8
# error
0y = 5
todays date = "February 21"
## Error: <text>:21:2: unexpected symbol
## 20: # error
## 21: 0y
## ^numeric: integers and floating-point (decimal)
numberslogical: yes/no, true/false data; in R represented by
the special values TRUE and FALSE
T and F (no quotes) can be used as
shortcuts for TRUE/FALSE, but you should avoid
this!character: strings, textfactor: special variable type for categorical (nominal
& ordinal) data=, <-<-+, -, *, /,
^Mathematical & functions
sin(), cos()log(), exp(), sqrt()Get help on a function with ? or help(),
for example: ?log or help(log). If you don´t
know a function’s name, you can search for a (likely/suspected) string
in its name with ??.
x = 5
# The print() function takes one or more arguments
# in this case the variable x
# It returns no value, but has the side effect of printing the
# value of x to the screen
print(x)
## [1] 5
# compute the natural logarithm of x
# then store it in y, then print the result
y = log(x)
print(y)
## [1] 1.609438
# functions can take multiple arguments, and arguments can be named
# here we change the base of the logarithm
log(x, base = 10)
## [1] 0.69897
# equivalents, found in the help file with ?log
# log(x, 10)
# log10(x)vector holds one or more values of a single data
typec() is a simple function that concatenates
its arguments into a vector: operator creates a sequence of integers, counting
by onesseq function
rep# The c() function stands for concatenate
# it groups items together into a vector
(five_numbers = c(3, 2, 8.6, 4, 9.75))
## [1] 3.00 2.00 8.60 4.00 9.75
# Creating vectors
(one_to_ten = 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
class(one_to_ten)
## [1] "integer"
# integers are *also* numeric
is(one_to_ten, "numeric")
## [1] TRUE
## sequences and repeats
seq(1, 5, 0.4)
## [1] 1.0 1.4 1.8 2.2 2.6 3.0 3.4 3.8 4.2 4.6 5.0
rep(0, 3)
## [1] 0 0 0
# math on vectors is performed on each element
five_numbers + 1
## [1] 4.00 3.00 9.60 5.00 10.75
five_numbers^2
## [1] 9.0000 4.0000 73.9600 16.0000 95.0625
sin(five_numbers)
## [1] 0.1411200 0.9092974 0.7343971 -0.7568025 -0.3195192
# vectors of other data types
class(five_numbers)
## [1] "numeric"
(five_letters = c("c", "b", "z", "q", "w"))
## [1] "c" "b" "z" "q" "w"
class(five_letters)
## [1] "character"
sort(five_letters)
## [1] "b" "c" "q" "w" "z"
as(five_numbers, "integer")
## [1] 3 2 8 4 9[] operator[]!five_numbers
## [1] 3.00 2.00 8.60 4.00 9.75
five_numbers[1]
## [1] 3
five_numbers[3:5]
## [1] 8.60 4.00 9.75
five_numbers[c(1,4)]
## [1] 3 4
# get vector length
length(five_numbers)
## [1] 5
five_numbers[10]
## [1] NA
# can index with another variable
(i = length(five_numbers))
## [1] 5
five_numbers[i]
## [1] 9.75
# also an expression
five_numbers[i - 1]
## [1] 4mean(five_numbers)
## [1] 5.47
median(five_numbers)
## [1] 4
sd(five_numbers)
## [1] 3.479152
var(five_numbers)
## [1] 12.1045
sum(five_numbers)
## [1] 27.35
# mathematical functions
exp(five_numbers)
## [1] 20.085537 7.389056 5431.659591 54.598150 17154.228809
log(five_numbers)
## [1] 1.0986123 0.6931472 2.1517622 1.3862944 2.2772673TRUE and
FALSE (no quotes)==, !=, <,
>, <=, >=# less than, greater than, less/greater or equals
1 < 2
## [1] TRUE
class(1 < 2)
## [1] "logical"
2 <= 2
## [1] TRUE
2 > 4
## [1] FALSE
'q' >= 'm'
## [1] TRUE
# equality testing
2 == 1 + 1
## [1] TRUE
2 != 'q'
## [1] TRUE
# be careful with decimals!
0.2 + 0.1 == 0.3
## [1] FALSE
# preferred:
all.equal(0.1 + 0.2, 0.3)
## [1] TRUE& (and), | (or; the vertical bar),
! (not).# and: both must be true
TRUE & TRUE
## [1] TRUE
TRUE & FALSE
## [1] FALSE
FALSE & FALSE
## [1] FALSE
# or: only one needs to be true
TRUE | FALSE
## [1] TRUE
FALSE | FALSE
## [1] FALSE
# not: returns the opposite
!FALSE
## [1] TRUE
# both statements are TRUE, so this is TRUE
1 + 1 == 2 & 2 < 4
## [1] TRUEDownload the Palmer penguins dataset
Save this in your project folder, inside another folder named
data
Artwork by @allison_horst
read.csv function takes a file path or URL as the
argumentdata.frame"data/penguins.csv" is a relative
path.
"data"."/Users/ltalluto/Desktop/intro_r/data/penguins.csv"
Recommendation: use relative paths!
Hint: use .. to refer to one directory above the
current (working) directory
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
| Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
| Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
| Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
| Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
data.frame holds tabular data
row of a data frame is a single
case (i.e., an observation)column of a data frame is a single variable (i.e.,
a vector, all the same data type)head shows the first few rows of a data frame# Load a dataset named 'penguins' and convert it to a data frame
# this dataset comes from a package, "palmerpenguins"
data(penguins, package = "palmerpenguins")
penguins = as.data.frame(penguins)
head(penguins)| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
| Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
| Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
| Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
| Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
View shows the data frame in a graphical window (Or
double-click the dataset in the Environment tab)str gives you a summary of the
structure of the datastr(penguins)
## 'data.frame': 344 obs. of 8 variables:
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...nrow, ncol, dim give you
dimensions of the data frame$ or [], lets
you ‘peek’ inside the data.frame and access variables there.
# Error: there is no variable with that name!
bill_length_mm
## Error in eval(expr, envir, enclos): object 'bill_length_mm' not found# equivalent: accessing a variable by position
penguins[,3] # get every row in the third column
## [1] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6
## [16] 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6 40.5 37.9 40.5
## [31] 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0
## [46] 39.6 41.1 37.5 36.0 42.3 39.6 40.1 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6
## [61] 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5
## [76] 42.8 40.9 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9
## [91] 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2 35.0 41.0 37.7 37.8 37.9
## [106] 39.7 38.6 38.2 38.1 43.2 38.1 45.6 39.7 42.2 39.6 42.7 38.6 37.3 35.7 41.1
## [121] 36.2 37.7 40.2 41.4 35.2 40.6 38.8 41.5 39.0 44.1 38.5 43.1 36.8 37.5 38.1
## [136] 41.1 35.6 40.2 37.0 39.7 40.2 40.6 32.1 40.7 37.3 39.0 39.2 36.6 36.0 37.8
## [151] 36.0 41.5 46.1 50.0 48.7 50.0 47.6 46.5 45.4 46.7 43.3 46.8 40.9 49.0 45.5
## [166] 48.4 45.8 49.3 42.0 49.2 46.2 48.7 50.2 45.1 46.5 46.3 42.9 46.1 44.5 47.8
## [181] 48.2 50.0 47.3 42.8 45.1 59.6 49.1 48.4 42.6 44.4 44.0 48.7 42.7 49.6 45.3
## [196] 49.6 50.5 43.6 45.5 50.5 44.9 45.2 46.6 48.5 45.1 50.1 46.5 45.0 43.8 45.5
## [211] 43.2 50.4 45.3 46.2 45.7 54.3 45.8 49.8 46.2 49.5 43.5 50.7 47.7 46.4 48.2
## [226] 46.5 46.4 48.6 47.5 51.1 45.2 45.2 49.1 52.5 47.4 50.0 44.9 50.8 43.4 51.3
## [241] 47.5 52.1 47.5 52.2 45.5 49.5 44.5 50.8 49.4 46.9 48.4 51.1 48.5 55.9 47.2
## [256] 49.1 47.3 46.8 41.7 53.4 43.3 48.1 50.5 49.8 43.5 51.5 46.2 55.1 44.5 48.8
## [271] 47.2 NA 46.8 50.4 45.2 49.9 46.5 50.0 51.3 45.4 52.7 45.2 46.1 51.3 46.0
## [286] 51.3 46.6 51.7 47.0 52.0 45.9 50.5 50.3 58.0 46.4 49.2 42.4 48.5 43.2 50.6
## [301] 46.7 52.0 50.5 49.5 46.4 52.8 40.9 54.2 42.5 51.0 49.7 47.5 47.6 52.0 46.9
## [316] 53.5 49.0 46.2 50.9 45.5 50.9 50.8 50.1 49.0 51.5 49.8 48.1 51.4 45.7 50.7
## [331] 42.5 52.2 45.2 49.3 50.2 45.6 51.9 46.8 45.7 55.8 43.5 49.6 50.8 50.2$ or [], lets
you ‘peek’ inside the data.frame and access variables there.[]# Get the first value in the third column
penguins[1,3]
## [1] 39.1
# Get the whole third column, then use head to print the first six values
head(penguins[,3])
## [1] 39.1 39.5 40.3 NA 36.7 39.3
# First five values of columns 3 and 4
penguins[1:5, c(3,4)]
## bill_length_mm bill_depth_mm
## 1 39.1 18.7
## 2 39.5 17.4
## 3 40.3 18.0
## 4 NA NA
## 5 36.7 19.3
# Can also index by name
penguins[1:5, c("bill_length_mm", "bill_depth_mm")]
## bill_length_mm bill_depth_mm
## 1 39.1 18.7
## 2 39.5 17.4
## 3 40.3 18.0
## 4 NA NA
## 5 36.7 19.3
# Remove the first 300 rows
penguins[-(1:300),]
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 301 Chinstrap Dream 46.7 17.9 195 3300
## 302 Chinstrap Dream 52.0 19.0 197 4150
## 303 Chinstrap Dream 50.5 18.4 200 3400
## 304 Chinstrap Dream 49.5 19.0 200 3800
## 305 Chinstrap Dream 46.4 17.8 191 3700
## 306 Chinstrap Dream 52.8 20.0 205 4550
## 307 Chinstrap Dream 40.9 16.6 187 3200
## 308 Chinstrap Dream 54.2 20.8 201 4300
## 309 Chinstrap Dream 42.5 16.7 187 3350
## 310 Chinstrap Dream 51.0 18.8 203 4100
## 311 Chinstrap Dream 49.7 18.6 195 3600
## 312 Chinstrap Dream 47.5 16.8 199 3900
## 313 Chinstrap Dream 47.6 18.3 195 3850
## 314 Chinstrap Dream 52.0 20.7 210 4800
## 315 Chinstrap Dream 46.9 16.6 192 2700
## 316 Chinstrap Dream 53.5 19.9 205 4500
## 317 Chinstrap Dream 49.0 19.5 210 3950
## 318 Chinstrap Dream 46.2 17.5 187 3650
## 319 Chinstrap Dream 50.9 19.1 196 3550
## 320 Chinstrap Dream 45.5 17.0 196 3500
## 321 Chinstrap Dream 50.9 17.9 196 3675
## 322 Chinstrap Dream 50.8 18.5 201 4450
## 323 Chinstrap Dream 50.1 17.9 190 3400
## 324 Chinstrap Dream 49.0 19.6 212 4300
## 325 Chinstrap Dream 51.5 18.7 187 3250
## 326 Chinstrap Dream 49.8 17.3 198 3675
## 327 Chinstrap Dream 48.1 16.4 199 3325
## 328 Chinstrap Dream 51.4 19.0 201 3950
## 329 Chinstrap Dream 45.7 17.3 193 3600
## 330 Chinstrap Dream 50.7 19.7 203 4050
## 331 Chinstrap Dream 42.5 17.3 187 3350
## 332 Chinstrap Dream 52.2 18.8 197 3450
## 333 Chinstrap Dream 45.2 16.6 191 3250
## 334 Chinstrap Dream 49.3 19.9 203 4050
## 335 Chinstrap Dream 50.2 18.8 202 3800
## 336 Chinstrap Dream 45.6 19.4 194 3525
## 337 Chinstrap Dream 51.9 19.5 206 3950
## 338 Chinstrap Dream 46.8 16.5 189 3650
## 339 Chinstrap Dream 45.7 17.0 195 3650
## 340 Chinstrap Dream 55.8 19.8 207 4000
## 341 Chinstrap Dream 43.5 18.1 202 3400
## 342 Chinstrap Dream 49.6 18.2 193 3775
## 343 Chinstrap Dream 50.8 19.0 210 4100
## 344 Chinstrap Dream 50.2 18.7 198 3775
## sex year
## 301 female 2007
## 302 male 2007
## 303 female 2008
## 304 male 2008
## 305 female 2008
## 306 male 2008
## 307 female 2008
## 308 male 2008
## 309 female 2008
## 310 male 2008
## 311 male 2008
## 312 female 2008
## 313 female 2008
## 314 male 2008
## 315 female 2008
## 316 male 2008
## 317 male 2008
## 318 female 2008
## 319 male 2008
## 320 female 2008
## 321 female 2009
## 322 male 2009
## 323 female 2009
## 324 male 2009
## 325 male 2009
## 326 female 2009
## 327 female 2009
## 328 male 2009
## 329 female 2009
## 330 male 2009
## 331 female 2009
## 332 male 2009
## 333 female 2009
## 334 male 2009
## 335 male 2009
## 336 female 2009
## 337 male 2009
## 338 female 2009
## 339 female 2009
## 340 male 2009
## 341 female 2009
## 342 male 2009
## 343 male 2009
## 344 female 2009data.frame or
vector based on some conditionsubset function# Return a data frame with only one species
penguins_adelie = subset(penguins, species == "Adelie")
head(penguins_adelie)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18.0 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## sex year
## 1 male 2007
## 2 female 2007
## 3 female 2007
## 4 <NA> 2007
## 5 female 2007
## 6 male 2007data.frame or
vector based on some conditionsubset function%in%
operatordata.frame or
vector based on some conditionsubset function%in%
operator[]# get the body mass of all female penguins with a larger-than-average bill length
penguins[penguins$sex == "female" & penguins$bill_length_mm > mean(penguins$bill_length_mm, na.rm = TRUE), "body_mass_g"]
## [1] NA 4500 4450 4550 4800 4650 4200 4800 5000 4400 NA 4600 5050 5150 4350
## [16] 4300 4200 5100 4850 4400 4900 4200 4400 4700 NA 4750 5200 4700 4600 4750
## [31] 4625 4725 4750 4875 4950 4750 4850 4875 4625 4850 4975 NA 5000 4375 NA
## [46] 4925 NA 4850 5200 3500 3525 3950 3250 4150 3800 3700 3575 3700 3450 3300
## [61] 3400 3700 3900 3850 2700 3650 3500 3675 3400 3675 3325 3600 3250 3525 3650
## [76] 3650 3775Read in the penguin data using read.csv. Store it in a
variable named penguins. Then try to do the following.
dim,
nrow, and ncol.mean(penguins$bill_depth_mm)?
Why? What if you try
mean(penguins$bill_depth_mm, na.rm = TRUE)? Check the help
file by typing ?mean. Does the same logic work for
sd?table function.c() inside [] to get multiple
columns. See the previous slide.body_mass_kg, and compute the
body mass, in kg instead of g.
$ to create a new column by namesubset to compute the mean bill length and bill
depth for Gentoo penguins## 1.
# dim gives rows and columns, in that order
nrow(penguins)
## [1] 344
ncol(penguins)
## [1] 8
dim(penguins)
## [1] 344 8
## 2.
# there are missing values, so this doesn't work
mean(penguins$bill_depth_mm)
## [1] NA
# adding na.rm = TRUE gives us what we expect
mean(penguins$bill_depth_mm, na.rm = TRUE)
## [1] 17.15117
sd(penguins$bill_depth_mm, na.rm = TRUE)
## [1] 1.974793
## 3.
table(penguins$year)
##
## 2007 2008 2009
## 110 114 120
## 4.
table(penguins[, c('species', 'year')])
## year
## species 2007 2008 2009
## Adelie 50 50 52
## Chinstrap 26 18 24
## Gentoo 34 46 44
## 5.
# nothing is printed! but the new column is created with the data already stored
# can confirm by looking at the data frame with head
penguins$body_mass_kg = penguins$body_mass_g / 1000
head(penguins)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18.0 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## sex year body_mass_kg
## 1 male 2007 3.75
## 2 female 2007 3.80
## 3 female 2007 3.25
## 4 <NA> 2007 NA
## 5 female 2007 3.45
## 6 male 2007 3.65
## 6.
penguins_gentoo = subset(penguins, species == "Gentoo")
mean(penguins_gentoo$bill_length_mm, na.rm = TRUE)
## [1] 47.50488
mean(penguins_gentoo$bill_depth_mm, na.rm = TRUE)
## [1] 14.98211Our data frame has some missing data. Let’s remove the incomplete
cases using the complete.cases function.
mean, sd, varquantile, range, sumcomplete.casesspecies, sex, year,
islandtable allows us to count cases across partitionstapply (apply a function in a table) is a
generalisation of table, allows us to compute any statistic
across partitions
# one variable, one partition
tapply(penguins$bill_length_mm, penguins$species, mean, na.rm = TRUE)
## Adelie Chinstrap Gentoo
## 38.82397 48.83382 47.56807
# one variable, two partitions
tapply(penguins$bill_length_mm, penguins[, c('species', 'sex')],
mean, na.rm = TRUE)
## sex
## species female male
## Adelie 37.25753 40.39041
## Chinstrap 46.57353 51.09412
## Gentoo 45.56379 49.47377table allows us to count cases across partitionstapply (apply a function in a table) is a
generalisation of table, allows us to compute any statistic
across partitions
sapply (simple apply) applies a function to multiple
variables but with no partitions
boxplot
function. Try the following syntax:
boxplot(y ~ group1 + group2, data = penguins).
y, substitute the name of the variable inside
penguins that you want to plot (e.g.,
flipper_length_mm).group1 and group2, substitite the
names of any grouping variables you want to use, like
sex.xlab and ylab, the axis titlesborder, col, the colors of the box borders
and the fill colornotch, boxwex (see if you can figure them
out from the help file, or by experimenting)# Subset the data to look only at Adelie penguins.
penguins_adelie = subset(penguins, species == "Adelie")
# Does the distribution of flipper length of Adelie penguins vary
# among the different islands?
# we can use tapply with the different functions
## mean
tapply(penguins_adelie$flipper_length_mm,
penguins_adelie[, c('island', 'sex')], mean)
## sex
## island female male
## Biscoe 187.1818 190.4091
## Dream 187.8519 191.9286
## Torgersen 188.2917 194.9130
## sd
tapply(penguins_adelie$flipper_length_mm,
penguins_adelie[, c('island', 'sex')], sd)
## sex
## island female male
## Biscoe 6.744567 6.463517
## Dream 5.510156 6.803749
## Torgersen 4.638958 5.915412
## median
tapply(penguins_adelie$flipper_length_mm,
penguins_adelie[, c('island', 'sex')], median)
## sex
## island female male
## Biscoe 187 191.0
## Dream 188 190.5
## Torgersen 189 195.0
## min
tapply(penguins_adelie$flipper_length_mm,
penguins_adelie[, c('island', 'sex')], min)
## sex
## island female male
## Biscoe 172 180
## Dream 178 178
## Torgersen 176 181
## max
tapply(penguins_adelie$flipper_length_mm,
penguins_adelie[, c('island', 'sex')], max)
## sex
## island female male
## Biscoe 199 203
## Dream 202 208
## Torgersen 196 210# a boxplot is a much nicer way to visualise this
# the defaults are quite ugly, let's improve things
# I chose colours using colorbrewer
# https://colorbrewer2.org/#type=qualitative&scheme=Paired&n=7
# 3 colors, one for each island
cols = c("#a6cee3", "#fb9a99", "#fdbf6f")
boxplot(flipper_length_mm ~ island+sex, data = penguins_adelie,
at = c(1.6,2,2.4, 3.6,4,4.4), # at controls x-location, allows grouping by sex
col = rep(cols, 2), # set the colours, repeated twice (female and male)
names = c("", "Female", "", "", "Male", ""), # labels under the boxes
xlab = "", ylab = "Flipper length (mm)", # set axis titles
bty = 'n', #disable the box around the plot
notch = TRUE,
boxwex = 0.3 # thinner boxes
)
# add a legend to the plot
legend("topleft", legend = unique(penguins_adelie$island),
title = "Island", fill = cols, bty = 'n')R currently has two dominant graphical engines
Defaults are not amazing, some improvements needed
main = "" disables the titlexlab and ylab control axis labelsbreaks controls the number of bins in the
histogramcol sets the color of the barsborder sets the color of the borders (NA:
no border)#RRGGBB
00 (none) to FF
(most)col = 'rosybrown'colors() function for the namesscales package to generate custom
palettes.t.testaov to do an analysis of
variance when there are more than 2 groups or when there are more than
two factors.t.test to test the hypothesis. Check the help file
to get started.geom_violin) visualising the differences in body mass among
these 2 different factor variables.aov to perform an analysis of variance for each
hypothesis. What do you conclude?Hints
y ~ x (y is modelled by x). This also works with
t.test!aov doesn’t produce very nice output by itself. Try
saving the result of aov in a variable and using
summary on that variable.Base graphics
Easiest just to do this in 2 panels. I set the axis limits so we have a fair comparison.
# create a multipanel figure, organised by rows, with 1 row and 2 colums
par(mfrow = c(1,2))
hist(subset(penguins_gentoo, sex == "female")$body_mass_g,
main = "Female Gentoo", col = 'pink',
border = NA, xlab = "Body Mass (g)", xlim = c(3800, 6500))
hist(subset(penguins_gentoo, sex == "male")$body_mass_g,
main = "Male Gentoo", col = 'lightblue',
border = NA, xlab = "Body Mass (g)", xlim = c(3800, 6500))GGplot
ggplot allows for fancier options
alternative = "less"?##
## Welch Two Sample t-test
##
## data: body_mass_g by sex
## t = -14.761, df = 116.64, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group female and group male is less than 0
## 95 percent confidence interval:
## -Inf -714.6651
## sample estimates:
## mean in group female mean in group male
## 4679.741 5484.836
library(RColorBrewer)
(cols = brewer.pal(3, "Paired"))#[c(1, 5, 7)])
## [1] "#A6CEE3" "#1F78B4" "#B2DF8A"
boxplot(body_mass_g ~ species+sex, data = penguins,
# at controls x-location, allows grouping by sex
at = c(1.6,2,2.4, 3.6,4,4.4),
# set the colours, repeated twice (female and male)
col = rep(cols, 2),
# axis label text size
cex.axis = 0.8,
# labels under the boxes
names = c("", "Female", "", "", "Male", ""),
# set axis titles
xlab = "", ylab = "Body mass (g)",
# disable the box around the plot
bty = 'n',
notch = TRUE,
# thinner boxes
boxwex = 0.3
)
# add a legend to the plot
legend("topleft", legend = unique(penguins$species),
title = "Species", fill = cols, bty = 'n')cor function to compute correlations
among 2 or more variablescor function using all 4 numeric variables.cor.testplot to produce a scatterplot of the 2 variables.
Customize the plot to make it look a bit bettercor function to compute correlations
among 2 or more variablescor function using all 4 numeric variables.# I use round to make the printing nicer
round(cor(penguins[, 3:6]), digits = 3)
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm 1.000 -0.229 0.653 0.589
## bill_depth_mm -0.229 1.000 -0.578 -0.472
## flipper_length_mm 0.653 -0.578 1.000 0.873
## body_mass_g 0.589 -0.472 0.873 1.000cor function to compute correlations
among 2 or more variablescor function using all 4 numeric variables.cor.test# I chose the weakest correlation
cor(penguins$bill_depth_mm, penguins$bill_length_mm)
## [1] -0.2286256
cor.test(penguins$bill_depth_mm, penguins$bill_length_mm )
##
## Pearson's product-moment correlation
##
## data: penguins$bill_depth_mm and penguins$bill_length_mm
## t = -4.2726, df = 331, p-value = 2.528e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3280409 -0.1242017
## sample estimates:
## cor
## -0.2286256# magic spell to give each species a different color
cols = factor(penguins$species)
plot(penguins$bill_depth_mm, penguins$bill_length_mm,
pch = 16, #dots instead of circles
col = cols, xlab = "Bill depth (mm)",
ylab = "Bill length(mm)", bty = 'n'
)
legend("topleft", legend = levels(cols), fill = 1:3, bty = "n")lm function
aov, use the summary function to get
a better look at the resultsanova function can also be useful for models with
categorical predictorsmod_billdepth = lm(bill_depth_mm ~ bill_length_mm * species + bill_length_mm * sex, data = penguins)
summary(mod_billdepth)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm * species + bill_length_mm *
## sex, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0740 -0.5208 -0.0642 0.4621 2.9887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.25509 1.08618 14.965 < 2e-16 ***
## bill_length_mm 0.03677 0.02854 1.289 0.198476
## speciesChinstrap -3.84651 1.93284 -1.990 0.047419 *
## speciesGentoo -6.62661 1.75771 -3.770 0.000194 ***
## sexmale 2.21069 0.91051 2.428 0.015726 *
## bill_length_mm:speciesChinstrap 0.07512 0.04330 1.735 0.083688 .
## bill_length_mm:speciesGentoo 0.06389 0.04068 1.571 0.117272
## bill_length_mm:sexmale -0.02183 0.02070 -1.054 0.292466
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8182 on 325 degrees of freedom
## Multiple R-squared: 0.831, Adjusted R-squared: 0.8274
## F-statistic: 228.3 on 7 and 325 DF, p-value: < 2.2e-16
anova(mod_billdepth)
## Analysis of Variance Table
##
## Response: bill_depth_mm
## Df Sum Sq Mean Sq F value Pr(>F)
## bill_length_mm 1 67.30 67.30 100.5327 <2e-16 ***
## species 2 920.55 460.27 687.6071 <2e-16 ***
## sex 1 79.72 79.72 119.0908 <2e-16 ***
## bill_length_mm:species 2 1.60 0.80 1.1982 0.3031
## bill_length_mm:sex 1 0.74 0.74 1.1118 0.2925
## Residuals 325 217.55 0.67
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1table function to view the countsbarplot for visualisingprop.test for the significance testtable function to view the countsbarplot for visualisingprop.test for the significance test# save it in a variable so we can use the counts later
tab = table(penguins_gentoo$sex)
tab
##
## female male
## 58 61
prop.test(tab)
##
## 1-sample proportions test with continuity correction
##
## data: tab, null probability 0.5
## X-squared = 0.033613, df = 1, p-value = 0.8545
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.3953483 0.5802663
## sample estimates:
## p
## 0.487395year is associated with the
variable species using a chi-squared
test.table and barplot for
exploration.chisq.test for the testwoodpecker = read.csv("https://raw.githubusercontent.com/flee-group/vu_datenanalyse/main/datasets/woodpecker.csv")
head(woodpecker)
## forest_type nest_tree
## 1 burned birch
## 2 burned birch
## 3 burned jack pine
## 4 burned aspen
## 5 burned birch
## 6 burned jack pine
table(woodpecker)
## nest_tree
## forest_type aspen birch jack pine
## burned 6 16 2
## unburned 29 18 23We want to test for an association between the two variables (forest type and nest tree)
\(H_0\): Nesting tree is not associated with forest type
\(H_A\): Nest tree is associated with forest type
chisq.test on these data.barplot or geom_bar to visualiseBarplots, or proportional bars for counts within categories
table(woodpecker)
## nest_tree
## forest_type aspen birch jack pine
## burned 6 16 2
## unburned 29 18 23
woodp_plot = ggplot(woodpecker, aes(x = nest_tree,
fill = forest_type))
woodp_plot = woodp_plot + geom_bar(width = 0.5,
position=position_dodge())
woodp_plot = woodp_plot + xlab("Nest Tree Type") +
theme_minimal() + labs(fill = "Forest Type")
woodp_plotSide-by-side bars allow us to compare all categories on equal footing.